Machine Learning (ML) in Bioinformatics


Dimensionality Reduction


image
Prerequisites: None.
Level: Beginner.
Learning objectives:

Introduction

Dimensionality reduction is a technique used to reduce the number of features or dimensions in a dataset while keeping as much information as possible. It is often used in machine learning, data analysis, and data visualization to reduce the data's complexity and make it easier to understand.

There are several methods for dimensionality reduction, including principal component analysis (PCA), linear discriminant analysis (LDA), t-distributed stochastic neighbor embedding (t-SNE), and singular value decomposition (SVD).

Each method has strengths and limitations, and the appropriate method depends on the data's specific needs and the analysis's goals.

In this tutorial, we will explore the basics of dimensionality reduction and how it can be used to improve the performance of machine learning algorithms. We will also look at standard techniques for visualizing high-dimensional data and discuss the trade-offs between information retention and computational efficiency.

Overall, this tutorial aims to provide a comprehensive introduction to dimensionality reduction, including its motivation, the different methods available, and how to apply these methods in practical situations.

Why Use Dimensionality Reduction?

There are several reasons why dimensionality reduction can be helpful in machine learning and data analysis. Some of the main benefits include:

Improved performance:
By reducing the number of dimensions, we can reduce the computational complexity of the learning algorithm and improve its performance.
Reduced overfitting:
With fewer dimensions, there is less risk of overfitting the model.
More straightforward interpretation:
Visualizing and understanding high-dimensional data can be complex, but dimensionality reduction can help to project the data onto a lower-dimensional space where it is easier to understand and interpret.

Types of Dimensionality Reduction

There are several methods for dimensionality reduction, each with its strengths and limitations. Some of the most commonly used methods include:

Principal Component Analysis (PCA):
PCA is a linear dimensionality reduction method that projects the data onto a lower-dimensional space by finding the directions of maximum variance in the data.
Linear Discriminant Analysis (LDA):
LDA is a supervised dimensionality reduction method that projects the data onto a lower-dimensional space by maximizing the separation between different classes.
T-Distributed Stochastic Neighbor Embedding (t-SNE):
t-SNE is a nonlinear dimensionality reduction technique that projects the data onto a lower-dimensional space by preserving the local structure of the data.
Singular Value Decomposition (SVD):
SVD is a matrix decomposition technique that can reduce dimensionality by keeping only the top-ranked singular vectors.

Visualizing High-Dimensional Data

Before applying dimensionality reduction, it is often helpful to visualize the data in its original high-dimensional space. However, visualizing data in more than three dimensions can be challenging, as it is difficult for our brains to interpret and understand.

One way to visualize high-dimensional data is to use scatter plots with different pairs of dimensions. Scatter plots can give us a sense of the relationships between different features and help us identify patterns in the data.

Another option is to use parallel coordinate plots, allowing us to plot all dimensions at once by aligning them along vertical axes.

Another option is to use dimensionality reduction to project the data onto a lower-dimensional space and then visualize the data in this space. This kind of projection can be beneficial when working with large datasets or datasets with many dimensions, as it can help to reduce clutter and make the patterns in the data more transparent.

Trade-Offs in Dimensionality Reduction

As with any technique, dimensionality reduction comes with its own set of trade-offs. One trade-off to consider is the balance between information retention and computational efficiency.

The more dimensions we keep, the more information we retain, but the more significant the computational cost. On the other hand, reducing the number of dimensions can improve computational efficiency, but it also results in the loss of some information.

Another trade-off to consider is the balance between linear and nonlinear techniques. Linear techniques are generally faster and easier to implement, but they may not capture the complex nonlinear relationships in the data. Nonlinear techniques can capture more complex relationships, but they may be more computationally intensive and harder to interpret.

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a linear dimensionality reduction method that projects the data onto a lower-dimensional space by finding the directions of maximum variance in the data. It is often used in machine learning, data analysis, and data visualization to reduce the complexity of the data and to make it easier to understand.

How does PCA work?

PCA works by identifying the directions in the data (called "principal components") with the highest variance and projecting the data onto these directions.

The direction in which our data varies the most is the first principal component. The second principal component is the direction in which the data varies the second most, and so on.

Let us consider a simple example with two-dimensional data to understand how PCA works. Suppose we have a dataset with two features (x and y) and several samples (points on a scatter plot). The spread of the data along that axis represents the variance of the data along each feature.

PCA identifies the direction in which the data varies the most (the first principal component) and projects the data in this direction. In this example, the first principal component is represented by the red line, which is the line of best fit for our data. The blue line represents the second principal component, orthogonal to the first, and the direction in which the variation is the second largest.

After projecting the data onto the first principal component, we obtain a one-dimensional dataset that captures the maximum amount of variance in the original data. We can then use this reduced dataset for further analysis or visualization.

How to compute PCA?

There are several ways to compute PCA, but singular value decomposition (SVD) is the most common method. SVD is a matrix decomposition technique that decomposes a matrix into the product of three matrices:

  • U: A matrix of left singular vectors
  • S: A diagonal matrix of singular values
  • V: A matrix of right singular vectors

To compute PCA using SVD, we first center the data by subtracting the calculated mean value of each feature from each feature value. We then compute the SVD of the centered data matrix, and the columns of matrix V give the principal components.

To compute PCA using singular value decomposition (SVD), we follow these steps:

  1. Center the data: First, we center the data by subtracting the computed mean of each feature from each feature value. We do this to ensure that the data is centered around the origin, which makes it easier to interpret the results.
  2. Compute the SVD of the centered data matrix: We then compute the SVD of the centered data matrix, which decomposes the matrix into the product of three matrices: U, S, and V.
  3. Obtain the principal components: The principal components are given by the columns of the matrix V. The first principal component is the direction in which the data varies the most. The second principal component is the direction in which the data varies the second most, and so on.
  4. Project the data onto the principal components: To project the data onto the principal components, we multiply the centered data matrix by matrix V. This results in a new matrix. This matrix contains the same number of rows as the original data but with fewer columns. Note that the number of columns equals the number of principal components.
  5. Sort the principal components by variance: Finally, we sort them by the variance they capture, with the most important (highest variance) components coming first.

It is necessary to note that PCA is sensitive to the scaling of the features, so it is essential to standardize the data before applying PCA.

Standardization involves scaling the features to have zero mean and unit variance, which helps to ensure that the features are on a similar scale and do not dominate the results.

An example of how to compute PCA using the SVD method in Python

        
          import numpy as np

          # Center the data
          data = data - np.mean(data, axis=0)

          # Compute the SVD of the centered data matrix
          U, S, V = np.linalg.svd(data)

          # Obtain the principal components
          principal_components = V

          # We project the data onto the principal components
          data_projection = np.dot(data, principal_components.T)

          # Sort the principal components by variance
          sorted_components = principal_components[np.argsort(S)[::-1]]
        
      

This code first centers the data by subtracting the mean of each feature from each feature value. It then computes the SVD of the centered data matrix and obtains the principal components (the columns of the matrix V).

It then projects the data onto the principal components by multiplying the data by the transpose of the matrix V. Finally, it sorts the principal components by the variance they capture, with the highest variance components coming first.

Applications of PCA

PCA has a wide range of applications, including:

Data visualization:
By projecting the data onto the first few principal components, we can visualize the data in a lower-dimensional space, which can be helpful for understanding and interpreting the data.
Noise reduction:
By projecting the data onto the principal components with the highest variance, we can reduce the noise in the data and filter out the less important features.
Feature extraction:
By keeping only the top-ranked principal components, we can extract the most important features of the data and use them for further analysis or as input to a machine learning algorithm.
Data compression:
By projecting the data onto a lower-dimensional space, we can reduce the size of the dataset and improve the efficiency of data storage and transmission.

Limitations of PCA

PCA has several limitations, including:

  • PCA is a linear technique, which means it may not capture the complex nonlinear relationships in the data.
  • PCA is susceptible to the scaling of the features, so it is essential to standardize the data before applying PCA.
  • PCA is an unsupervised

Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) is a supervised dimensionality reduction method projecting the data onto a lower-dimensional space by maximizing the separation between different classes. It is often used in machine learning and pattern recognition to improve the performance of classification algorithms.

How does LDA work?

LDA works by finding the directions in the data that maximize the separation between different classes. It finds the directions (called "linear discriminants") with the highest between-class and lowest within-class variance.

Let us consider a simple example with two-dimensional data to understand how LDA works. Suppose we have a dataset with two features (x and y) and two classes (red and blue points on a scatter plot). The spread of the data along that axis represents the variance of the data along each feature.

PLOT HERE

LDA identifies the direction in which the data varies the most between the two classes (the first linear discriminant) and projects the data in this direction.

In this example, the first linear discriminant is represented by the red line, which is the line of best fit for the data. The second linear discriminant is orthogonal to the first and represents the direction in which the data varies the second most between the two classes.

After projecting the data onto the linear discriminants, we obtain a one- or two-dimensional dataset that captures the maximum separation between the two classes. We can then use this reduced dataset for further analysis or visualization or as input to a classification algorithm.

How to compute LDA?

There are several ways to compute LDA, but the most common method is solving the generalized eigenvalue problem. To compute LDA using this method, we follow these steps:

  1. Compute the mean vectors for each class: We compute the mean vector for each class, which is a D-dimensional vector where D is the number of features. The mean vector for each class gives us the average feature value for that class
  2. Compute the within-class scatter matrix: We then compute the within-class scatter matrix, a D x D matrix representing the spread of the data within each class.
  3. Compute the between-class scatter matrix: We also compute the between-class scatter matrix, a D x D matrix representing the separation between the different classes.
  4. Solve the generalized eigenvalue problem: Finally, we solve the generalized eigenvalue problem to find the linear discriminants (the eigenvectors) that maximize the separation between the classes. The eigenvectors are ranked by their corresponding eigenvalues, with the highest-ranked eigenvectors representing the most critical linear discriminants.
  5. Project the data onto the linear discriminants: To project the data onto the linear discriminants, we multiply the data matrix by the matrix of linear discriminants. This results in a new matrix with the same number of rows as the original data but with a reduced number of columns (the number of columns equals the number of linear discriminants).

Here is an example of how to compute LDA using the generalized eigenvalue method in Python:

        
          import numpy as np

          def compute_LDA(data, num_topics):
              # Compute the term-document matrix
              term_document_matrix = create_term_document_matrix(data)
              
              # Compute the word frequencies
              word_frequencies = np.sum(term_document_matrix, axis=0)
              
              # Create a diagonal matrix with the word frequencies
              word_frequencies_diag = np.diag(word_frequencies)
              
              # Compute the document frequencies
              document_frequencies = np.sum(term_document_matrix, axis=1)
              
              # Create a diagonal matrix with the document frequencies
              document_frequencies_diag = np.diag(document_frequencies)
              
              # Compute the matrix B
              B = np.dot(word_frequencies_diag, term_document_matrix)
              B = np.dot(B, document_frequencies_diag)
              
              # Compute the matrix A
              A = np.dot(term_document_matrix, document_frequencies_diag)
              A = np.dot(A, term_document_matrix.T)
              
              # Solve the generalized eigenvalue problem
              eigenvalues, eigenvectors = np.linalg.eig(B, A)
              
              # Sort the eigenvalues and eigenvectors in descending order
              idx = eigenvalues.argsort()[::-1]
              eigenvalues = eigenvalues[idx]
              eigenvectors = eigenvectors[:, idx]
              
              # Select the top k eigenvectors
              eigenvectors = eigenvectors[:, :num_topics]
              
              # Compute the topic matrix
              topic_matrix = np.dot(term_document_matrix.T, eigenvectors)
              
              # Normalize the rows of the topic matrix
              topic_matrix = topic_matrix / np.sum(topic_matrix, axis=1, keepdims=True)
              
              return topic_matrix

          def create_term_document_matrix(data):
              # Get the number of documents and the number of words
              num_documents = len(data)
              num_words = len(data[0])
              
              # Create an empty term-document matrix
              term_document_matrix = np.zeros((num_words, num_documents))
              
              # Fill the term-document matrix
              for i, document in enumerate(data):
                  for j, word_count in enumerate(document):
                      term_document_matrix[j, i] = word_count
                      
              return term_document_matrix

          # Example data = [[3, 0, 1, 0, 0], [2, 0, 0, 1, 0], [0, 1, 0, 1, 1], [1, 1, 0, 0, 0], [0, 0, 1, 1, 0]]
        
      

Applications of LDA

LDA has a wide range of applications, including:

Classification: By projecting the data onto the linear discriminants, we can improve classification algorithms' performance by reducing the data's dimensionality.

Data visualization: By projecting the data onto the first few linear discriminants, we can visualize the data in a lower-dimensional space, which can be helpful for understanding and interpreting the data.

Feature selection: By keeping only the top-ranked linear discriminants, we can select the most important features of the data and use them for further analysis or as input to a machine learning algorithm.

Limitations of LDA

LDA has several limitations, including:

  • LDA is a linear technique, which means it may not capture the complex nonlinear relationships in the data.
  • LDA assumes that the data follows a Gaussian distribution, which may not always be the case.
  • LDA is sensitive to the scaling of the features, so it is essential to standardize the data before applying LDA.

T-Distributed Stochastic Neighbor Embedding (t-SNE)

T-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm is an unsupervised machine learning technique for visualizing high-dimensional data. It is a nonlinear dimensionality reduction technique used to reduce the dimensions of a dataset while preserving its structure.

The purpose of t-SNE is to map high-dimensional data points into a two-dimensional or three-dimensional space to visualize them. t-SNE has been used in many fields, including computer vision, natural language processing, image recognition, and many more.

What is T-Distributed Stochastic Neighbor Embedding?

T-Distributed Stochastic Neighbor Embedding is a non-linear dimensionality reduction method commonly used for visualizing high-dimensional datasets.

The goal of t-SNE is to reduce the dataset's dimensionality while preserving the data's local structure. t-SNE works by constructing a probability distribution over the dataset, then finding a low-dimensional representation of the data that preserve these relationships.

How Does t-SNE Work?

The t-SNE algorithm works by converting high-dimensional data into a low-dimensional representation. The conversion is done by computing a similarity measure between each pair of data points in the original dataset. Moreover, the similarity measure is then used to construct a probability distribution over pairs of data points. Then probability distribution is then used to construct a low-dimensional data representation.

The t-SNE algorithm first calculates the similarity between each pair of data points in the original dataset. It uses a metric called the Kullback-Leibler divergence, which measures the difference between two probability distributions. The t-SNE algorithm then constructs a probability distribution over pairs of data points based on this similarity measure. This probability distribution is then used to construct a low-dimensional data representation.

The low-dimensional representation is then used to create a two- or three-dimensional data visualization. This visualization can be used to gain insights into the structure of the data.

How is t-SNE Applied?

t-SNE can be applied to any type of dataset, including images, text, or numerical data. First, the data must be preprocessed and scaled to ensure that the data points are centered around zero and have similar ranges of values. Once this is done, t-SNE can be used to reduce the dimensionality of your data while at the same time preserving its local structure.

The resulting visualization can be used to explore the data and identify clusters or trends that might otherwise have been difficult or impossible to spot. t-SNE can also be used to visualize high-dimensional data in a low-dimensional space, such as two- or three-dimensional plots.

Plots can help explore the relationships between different features in the data set and understand how they interact. t-SNE can also reduce the dimensionality of the data set while preserving the data's essential structure.

The reduced dimensionality can make it easier to analyze the data and identify patterns or clusters. t-SNE can also create interactive visualizations that allow users to explore the data more intuitively.

The t-SNE algorithm is a powerful unsupervised machine-learning technique for visualizing high-dimensional data. It is a nonlinear dimensionality reduction technique used to reduce the dimensions of a dataset while preserving its structure.

The t-SNE algorithm computes a similarity measure between each pair of data points in the original dataset and then constructs a probability distribution over pairs of data points.

This probability distribution is then used to construct a low-dimensional representation of the data, which can then be used to create a two- or three-dimensional visualization of the data. This representation can be used to gain insights into the data and uncover patterns and trends. Additionally, it can help to identify clusters or outliers in the data.

Evaluating the performance of dimensionality reduction algorithms

Dimensionality reduction algorithms are used to reduce the number of features or dimensions of a dataset while retaining as much information as possible. Dimensionality reduction can be helpful for data visualization and feature selection tasks.

However, it is essential to evaluate the performance of the dimensionality reduction algorithm to ensure that the desired outcome is achieved. We will discuss how to evaluate the performance of dimensionality reduction algorithms.

Measuring Performance

The performance of a dimensionality reduction algorithm can be measured in several ways. The most commonly used metrics are accuracy, precision, recall, and f-measure. Accuracy refers to the percentage of correct predictions made by the algorithm, and precision measures the proportion of correct predictions to the total number of predictions made.

Recall measures the proportion of correct predictions out of all the actual outcomes. Lastly, the F-measure is the harmonic mean of both precision and recall.

In addition to accuracy metrics, the time taken to perform the dimensionality reduction also needs to be considered. The computing time of dimensionality reduction is significant if the algorithm is used for real-time applications.

Data Preprocessing

Before evaluating the performance of a dimensionality reduction algorithm, it is crucial to preprocess the data. Preprocessing includes normalizing the data, removing outliers, transforming the data if necessary, and ensuring that the results are accurate and that the algorithm is evaluated on the same data set.

Cross-Validation

Cross-validation is a technique used to evaluate a model's performance. It involves splitting the data into two parts: training and testing sets. We use the training set to train our model while we evaluate the model's performance using the testing set. This technique allows us to assess the model's generalization ability, which is important for determining its true performance.

Hyperparameter Tuning

Most dimensionality reduction algorithms have several hyperparameters that must be tuned to achieve optimal performance. Hyperparameter tuning involves selecting the best parameter values to maximize the algorithm's performance. The selection can be made using techniques such as grid search, random search, and bayesian optimization.

Visualization

Visualizing the results of a dimensionality reduction algorithm can be very useful in understanding its performance. Visualization techniques such as t-SNE, PCA, and MDS can plot the data points in two or three dimensions. Visualization can help us identify clusters in the data and any underlying patterns the algorithm may have missed.

Conclusion

In this tutorial, we introduced several dimensionality reduction methods and discussed evaluating their performance.

We discussed various metrics and techniques that can be used to measure the algorithm's performance, such as accuracy metrics, preprocessing, cross-validation, hyperparameter tuning, and visualization. Using these techniques, we can ensure that the algorithm performs as expected and that the desired outcome is achieved.


References and further reading